feat: add replay support to runner group and fix replay duration overrun by JasonXuDeveloper · Pull Request #235 · Azure/kperf

JasonXuDeveloper · 2026-02-15T01:27:07Z

Summary

Runner group handler: Refactor buildBatchJobObject to support replay mode — skip configmap upload for replay, mount PVC for local profiles, set REPLAY_PROFILE_SOURCE env var, use /run_replay.sh entrypoint
run_replay.sh: Entrypoint script for replay runner pods that invokes kperf runner replay and uploads results (with proper quoting and --data-binary)
Dockerfile: Copy run_replay.sh and chmod +x scripts
Fix duration overrun: Enforce a hard context deadline at profile.Duration() + 30s in both Schedule and ScheduleSingleRunner so replays can't exceed the profile time
Context cancellation as success: Treat context.Canceled/DeadlineExceeded as success for all verbs — when the deadline fires, in-flight requests are cancelled intentionally (not failures), and this avoids the ObserveFailure() mutex thundering-herd at shutdown
Remove unnecessary atomics: Change per-worker metrics from int32/atomic to plain int (each goroutine owns its instance)
Simplify WATCH goroutines: Fire-and-forget execution, remove redundant context check and concurrent metric writes
Worker formula: Update recommendedWorkers from conns * 3 to QPS-based calculation
Runner hardening: Guard against empty restClis in startWorkers
Scheduler hardening: Validate runnerIndex bounds in ScheduleSingleRunner

Test plan

go build ./... passes
go vet ./... passes
go test ./replay/... passes
End-to-end: run a 15-minute replay profile and confirm it completes in ~15 minutes

Part 6 of 6 in the replay feature PR stack. Depends on PR #234.

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds “replay mode” execution to kperf, spanning runner-group deployment changes (indexed Jobs + replay entrypoint), a new replay engine/package (profile loading, partitioning, scheduling, runner), and CLI wiring for both local replay runs and distributed runner pods.

Changes:

Add replay profile types + loader, scheduler, runner, partitioning, and request builder under replay/.
Extend runnergroup deployment to support replay mode (skip configmap upload, mount PVC optionally, run /run_replay.sh in indexed Jobs).
Add CLI commands for replay (kperf replay run local mode, kperf runner replay for runner pods) and refactor latency percentile reporting.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
testdata/sample_replay_runnergroup.yaml	Example RunnerGroupSpec for replay mode (URL/PVC profile sources).
testdata/sample_replay.yaml	Sample replay profile YAML with realistic request sequence.
scripts/run_replay.sh	New runner-pod entrypoint for replay mode + result upload loop.
runner/group/handler.go	Replay-aware job building (script, env, PVC mount) + skip configmap upload in replay mode.
replay/schedule_test.go	Tests for result aggregation and config warning validation behavior.
replay/schedule.go	Local multi-runner scheduler + single-runner entry for distributed mode + aggregation/warnings.
replay/runner_test.go	Unit tests/benchmarks for runner internals (bucket sizing, indexing).
replay/runner.go	Replay runner implementation (worker pool, WATCH handling, metrics).
replay/partition_test.go	Tests for deterministic partitioning and per-object ordering.
replay/partition.go	Partitioning logic (FNV-1a by object key) + distribution analysis helpers.
replay/loader_test.go	Tests for loading profiles from file/gzip and validation errors.
replay/loader.go	Profile loader supporting local paths and HTTP(S) + gzip auto-detection.
replay/builder_test.go	Tests for request building, masking, method mapping, query handling.
replay/builder.go	Request builder/executor using `rest.Interface` + masking for metric aggregation.
metrics/utils.go	New helper to build percentile latency reports (aggregate + per-URL) with optional raw data.
cmd/kperf/commands/runnergroup/run.go	Avoid clobbering `nodeAffinity` unless CLI flags are provided.
cmd/kperf/commands/runner/runner.go	Add `runner replay` subcommand + reuse percentile-report helper + config validation.
cmd/kperf/commands/root.go	Register top-level `replay` command.
cmd/kperf/commands/replay/run_test.go	Tests for local replay report building (with/without raw data).
cmd/kperf/commands/replay/run.go	Implement `kperf replay run` local-mode command and JSON reporting.
cmd/kperf/commands/replay/root.go	Add `replay` CLI root with `run` subcommand.
api/types/runner_group.go	Add replay fields to RunnerGroupSpec + `IsReplayMode()`.
api/types/replay_test.go	Tests for replay types validation and helpers.
api/types/replay.go	Define replay profile/request/spec types + validation + duration helper.
api/types/load_traffic.go	Fix typo in comment (“target”).
Dockerfile	Include `/run_replay.sh` and ensure scripts are executable.

replay/loader.go

replay/schedule.go

scripts/run_replay.sh

replay/runner.go

replay/builder.go

replay/schedule.go

replay/runner.go

scripts/run_replay.sh

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

scripts/run_replay.sh:33

The log message Uploaded it is not very actionable when debugging uploads (it doesn’t include which file/runner or where it was uploaded). Consider logging the runner identity and/or target URL (and possibly the HTTP status) to make successful uploads traceable in pod logs.

      echo "Uploaded it"
      exit 0
      ;;

scripts/run_replay.sh:40

The message Leaking pod? skip is ambiguous/colloquial and makes it hard to understand the actual failure mode (404 from the result upload endpoint). Consider rewording to explicitly state that the runner is not recognized by the server (404) and that the pod is exiting as a result.

    404)
      echo "Leaking pod? skip"
      exit 1;

runner/group/handler.go

replay/runner.go

Distributed replay mode integration: - Replay-aware job building: skip configmap upload for replay mode, use indexed Jobs for runner assignment, custom replay entrypoint script - run_replay.sh: entrypoint script for replay runner pods that downloads the replay profile and invokes kperf runner replay - Dockerfile: chmod +x for scripts directory Signed-off-by: JasonXuDeveloper - 傑 <jason@xgamedev.net>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

…ring herd Treat context cancellation as success for all verbs (not just WATCH) to prevent mutex contention at shutdown. Remove unnecessary atomic operations on per-worker metrics and simplify WATCH goroutines to fire-and-forget. Update worker recommendation formula to be QPS-based. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

replay/runner.go

…Runner Both Schedule and ScheduleSingleRunner were called with context.Background() and never enforced a hard deadline. The replay would run until every request completed naturally (up to 60s timeout each), causing 15-min profiles to run 20+ minutes. Now both functions create a context.WithTimeout based on profile.Duration() plus a 30s grace period. When the deadline fires, in-flight requests get context.Canceled (treated as success per the previous commit), and WATCH connections are torn down immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

JasonXuDeveloper force-pushed the replay/pr6 branch 2 times, most recently from f644ffd to f5a367b Compare February 15, 2026 09:18

xinWeiWei24 requested a review from Copilot February 17, 2026 09:33

Copilot started reviewing on behalf of xinWeiWei24 February 17, 2026 09:34 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

JasonXuDeveloper force-pushed the replay/pr6 branch 18 times, most recently from dd4ffe3 to d03afca Compare February 18, 2026 22:54

JasonXuDeveloper requested a review from Copilot February 18, 2026 22:56

Copilot started reviewing on behalf of JasonXuDeveloper February 18, 2026 22:56 View session

JasonXuDeveloper changed the title ~~feat: add replay support to runner group and deployment infrastructure~~ feat: add replay mode to runner group deployment and harden runner/scheduler Feb 18, 2026

Copilot AI reviewed Feb 18, 2026

View reviewed changes

runner/group/handler.go Show resolved Hide resolved

replay/runner.go Outdated Show resolved Hide resolved

JasonXuDeveloper force-pushed the replay/pr6 branch from d03afca to 6c132ee Compare February 18, 2026 23:05

JasonXuDeveloper requested a review from Copilot February 18, 2026 23:05

Copilot started reviewing on behalf of JasonXuDeveloper February 18, 2026 23:06 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

JasonXuDeveloper changed the title ~~feat: add replay mode to runner group deployment and harden runner/scheduler~~ feat: add replay support to runner group and fix replay duration overrun Feb 19, 2026

JasonXuDeveloper requested a review from Copilot February 19, 2026 00:08

Copilot started reviewing on behalf of JasonXuDeveloper February 19, 2026 00:09 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

replay/runner.go Show resolved Hide resolved

replay/runner.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add replay support to runner group and fix replay duration overrun#235

feat: add replay support to runner group and fix replay duration overrun#235
JasonXuDeveloper wants to merge 3 commits intoAzure:unstable-replayfrom
JasonXuDeveloper:replay/pr6

JasonXuDeveloper commented Feb 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonXuDeveloper commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonXuDeveloper commented Feb 15, 2026 •

edited

Loading